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Abstract 

This paper presents a novel spectral algorithm with additive clustering, designed to identify over¬ 
lapping communities in networks. The algorithm is based on geometric properties of the spectrum 
of the expected adjacency matrix in a random graph model that we call stochastic blockmode I with 
overlap (SBMO). An adaptive version of the algorithm, that does not require the knowledge of the 
number of hidden communities, is proved to be consistent under the SBMO when the degrees in the 
graph are (slightly more than) logarithmic. The algorithm is shown to perform well on simulated 
data and on real-world graphs with known overlapping communities. 


1 Introduction 

Many datasets (e.g., social networks, gene regulation networks) take the form of graphs whose structure 
depends on some underlying communities. The commonly accepted definition of a community is that 
nodes tend to be more densely connected within a community than with the rest of the graph. Communi¬ 
ties are often hidden in practice and recovering the community structure directly from the graph is a key 
step in the analysis of these datasets. Spectral algorithms are popular methods for detecting communities 
[Von Luxburg, 2007], that consist in two phases. First, a spectral embedding is built, where the n nodes 
of the graph are projected onto some low dimensional space generated by well-chosen eigenvectors of 
some matrix related to the graph (e.g., the adjacency matrix or a Laplacian matrix). Then, a clustering 
algorithm (e.g., fc-means or /c-median) is applied to the n embedded vectors to obtain a partition of the 
nodes into communities. 

It turns out that the structure of many real datasets is better explained by overlapping communities. 
This is particularly true in social networks, in which the neighborhood of any given node is made of 
several social circles, that naturally overlap [Me Auley and Leskovec, 2012]. Similarly, in co-authorship 
networks, authors often belong to several scientific communities and in protein-protein interaction net¬ 
works, a given protein may belong to several protein complexes [Palla et ah, 2005]. The communities 
do not form a partition of the graph and new algorithms need to be designed. This paper presents a novel 
spectral algorithm, called spectral algorithm with additive clustering (S AAC). The algorithm consists in a 
spectral embedding based on the adjacency matrix of the graph, coupled with an additive clustering phase 
designed to find overlapping communities. The proposed algorithm does not require the knowledge of 
the number of communities present in the network, and can thus be qualified as adapfive. 

^Thomas Bonald and Marc Lelarge are members of the LINGS, Paris, France. See www.lincs.fr. 
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SAAC belongs to the family of model-based community detection methods, that are motivated by a 
random graph model depending on some underlying set of communities. In the non-overlapping case, 
spectral methods have been shown to perform well under the stochastic block model (SBM), introduced 
by Holland and Leinhardt [Holland and Leinhardt, 1983]. Our algorithm is inspired by the simplest 
possible extension of the SBM to overlapping communities, we refer to as the stochastic blockmodel 
with overlaps (SBMO). In the SBMO, each node is associated to a binary membership vector, indicating 
all the communities to which the node belongs. We show that exploiting an additive structure in the 
SBMO leads to an efficient method for the identification of overlapping communities. To support this 
claim, we provide consistency guarantees when the graph is drawn under the SBMO, and we show 
that SAAC exhibit state-of-the-art performance on real datasets for which ground-truth communities are 
known. 

The paper is structured as follows. In Section 2, we cast the problem of detecting overlapping 
communities into that of estimating a membership matrix in the SBMO model, introduced therein. In 
Section 3, we compare the SBMO with alternative random graph models proposed in the literature, and 
review the algorithms inspired by these models. In Section 4, we exhibit some properties of the spectrum 
of the adjacency matrix under SBMO, that motivate the new SAAC algorithm, introduced in Section 5, 
where we also formulate theoretical guarantees for an adaptive version of the algorithm. Finally, Section 
6 illustrates the performance of SAAC on both real and simulated data. 

Notation. We denote by ||x|| the Euclidean norm of a vector x e M'^. For any matrix M e we let 
Mi denote its f-th row and M.j its j-th column. For any S c d}, |5| denotes its cardinality and 

I 5 e {0,is a row vector such that ( 15 ) 1,1 = l{je 5 }- The Frobenius norm of a matrix M e is 

n d 

2=1 j=l l<ij<n 

The spectral norm of a symmetric matrix M e with eigenvalues Ai,..., is ||M|| = maxj=i,,rf |Ai|. 
For a e &k, we let P^j e the permutation matrix associated to a, defined by {Pa)k,i = ^a{k),i- 

2 The stochastic blockmodel with overlaps (SBMO) 

2.1 The model 

For any symmefric matrix A e [0,let A be some random symmetric binary matrix whose entries 
(Aij)i<j are independent Bernoulli random variables with respective parameters (Aij)i<j. Then A is the 
adjacency matrix of an undirected random graph with expected adjacency matrix A. In all the paper, we 
restrict the hat notation to variables that depend on this random graph. For example, the empirical degree 
of node i observed on the random graph and the expected degree of node i are respectively denoted by 

n n 

di = -^i,j and di = ^i,j' 

i=i i=i 

Similarly, we write D = Diag((ii), D = Diag((ii), and 

n n 

•— max C^max “ max ■ 

* i=i * i=i 
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The stochastic block model (SBM) with n nodes and K communities depends on some mapping 
k ■ {l,...,n} ^ ,K} that associates nodes to communities and on some symmetric community 

connectivity matrix B e [0,In this model, two nodes i and j are connected with probability 

“ ^k(i),k(j) ~ ^k(j) ,k(i) • 

Introducing a membership matrix Z e {0,1}”^^ such that ^ the expected adjacency 

matrix can be written 

A = ZBZ'^. 

The stochastic blockmodel with overlap (SBMO) is a slight extension of this model, in which Z is 
only assumed to be in {0, and Zi t Q for all i. Compared to the SBM, the rows of the member¬ 

ship matrix Z are no longer constrained to have only one non-zero entry. Since these n rows give the 
communities of the respective n nodes of the graph, this means that each node can now belong to several 
communities. 


2.2 Performance metrics 


Given some adjacency matrix A drawn under the SBMO, our goal is to recover the underlying commu¬ 
nities, that is to build an estimate Z of the membership matrix Z, up to some permutation of its columns 
(corresponding to a permutation of the community labels). We denote by K the estimate of the number 
of communities {K is in general unknown), so that Z e {0, 

We introduce two performance metrics for this problem. The first is related to the number of nodes 
that are “well classified”, in the sense that there is no error in the estimate of their membership vec¬ 
tor. The objective is to minimize the number of misclassified nodes of an estimate Z of Z, defined by 
MisC(Z, Z) = nifk t K wd 

M\sC{Z,Z) = min |{ie {1,..., n} : 3/c e t Zik}\ 

o-e6if 


ofherwise. The second performance metric is the fraction of wrong predictions in the membership 
matrix (again, up to a permutation of the community labels). We define the estimation error of Z as 
Error(Z, Z) = 1 if K K and otherwise by 


Error(Z, Z) 


- inf \\ZP^ 

nK 


■z\\l< 


MisC(Z,Z) 

n 


2.3 Identifiability 

The communities of a SBMO can only be recovered if the model is identifiable in that the equality 
Z'B'Z'^ = ZBZ^, for some integer K' and matrices Z' e {0, B' e [0, implies 

MisC(Z', Z) = 0 (and thus K' - K)\ two SBMO with the same expected adjacency matrices have the 
same communities, up to a permutation of the community labels. In this section, we derive sufficient 
conditions for identifiability. 

Example 1. Consider the following SBMO with n nodes and 3 overlapping communities: 
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where a,b,c> 0 and 1 (resp. Oj is a vector of length rijS with all coordinates equal to 1 (resp. Oj. This 
SBMO is not identifiable since ZBZ^ = Z'B'Z'^ with 
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Observe that this is a SBM with 3 non-overlapping communities. 

In view of the above example, some additional assumptions are required to ensure identifiability. A 
first approach is to restrict the analysis to SBM. The following result is proved in Appendix A. 

Proposition 2. The SBMO is identifiable under the following assumptions: 

(SBMl) for all i k, the rows Bf and B^ are different; 

(SBM2) for alli = l,...,n, ^i,£ = 1- 

Assumption (SBMl) is the usual condition for identifiability of a SBM; the absence of overlap is 
enforced by assumption (SBM2). Note that the SBM of Example 1 clearly satisfies both assumptions 
and thus is identifiable: this is the only SBM with expected adjacency matrix A = ZBZ^. One may 
wonder whether the SBMO is identifiable if we impose an overlap, that is the existence of some node i 
such that Y,f=\ ^ 2. The answer is negative, as shown by the following example. 


Example 1 (continued). Without loss of generality, we assume that c < min(a, 6). Consider the fol¬ 
lowing SBMO with n nodes and 4 overlapping communities: 


' a+b-c b-c a-c 0^ 

b-c b 0 0 

a-c 0 a 0 ’ 

0 0 0 c y 


/ 1 0 
Z" = 0 1 

^ 0 0 


0 1 \ 
0 1 
1 1 


We have ZBZ'^ = Z"B”Z”'^. 


Thus some additional assumptions are required to make the SBMO identifiable. It is in fact sufficient 
that the community connectivity matrix is invertible and that each community contains at least one pure 
node (that is, belonging to this community only). The following result is proved in Appendix A. 

Theorem 3. The SBMO is identifiable under the following assumptions: 

(SBMOl) B is invertible; 

(SBM02) for each k = 1,... ,K, there exists i such that Z^^i^ = ^i,£ = 1> 

Observe that the two SBMO of Example 1, with membership matrices Z and Z", violate (SBM02). 
Only the SBM is identifiable. In particular, if we generate a SBMO with 3 overlapping communities 
based on the matrices B and Z, our algorithm will return at best 3 non-overlapping communities corre¬ 
sponding to the SBM with membership matrix Z'. To recover the model (1), some additional information 
is required on the community structure. Eor instance, one may impose K = 3 and that each node belongs 
to exactly two communities. Note that this last condition alone is not sufficient, in view of the third 
model of Example 1. 

Our choice for SBMO 1-2 is motivated by applications to social networks: homophily will make the 
matrix B diagonally dominant, hence invertible. In the rest of the paper, we assume that the identifiability 
conditions (SBMOl) and (SBM02) are satisfied. 
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2.4 Subcommunity detection 

Any SBMO with K overlapping communities may be viewed as a SBM with up to 2^ non-overlapping 
communities, corresponding to groups of nodes sharing exactly the same communities in the SBMO and 
that we refer to as subcommunities. 

Let K' be the number of subcommunities in the SBMO: 

K' = \T\, where T = {2 e {0,1}^^^ : e {1,..., n} : Z* = z}. 

The corresponding SBM has K' communities indexed by z e T, with community connectivity matrix 
B' given by B'y^z = yBz^. The SBM of Example 1 can be derived from the first SBMO in this way for 
instance. More interestingly, it is easy to check that if the initial SBMO satisfies (SBM01)-(SBM02) 
then the corresponding SBM satisfies (SBM1)-(SBM2). 




Figure 1: Three overlapping communities of a SBMO (left) and the subcommunities of the associated 

SBM (right). 

This suggests that community detection in the SBMO reduces to community detection in the corre¬ 
sponding SBM, for which many efficient algorithms are known. However, the notion of performance 
for a SBM is different from the that for the underlying SBMO: the knowledge of the subcommunities 
is not sufficient to recover the initial overlapping communities, that is to obtain an estimate Z such that 
MisC(Z, Z) is small. It is indeed necessary to map these subcommunities to elements of {0,1}^, which 
is not an easy task: first, the number of communities K is unknown; second, assuming K is known, there 
are up to 2^! such mappings so that a simple approach by enumeration is not feasible in general. More¬ 
over, the performance of clustering algorithms degrades rapidly with the number of communities so that 
it is preferable to work directly on the K overlapping communities rather than on the K' subcommuni¬ 
ties, with K' possibly as large as 2^. 

Our algorithm detects directly the K overlapping communities using the specific geomefry of the 
eigenvectors of the expected adjacency matrix, A. We provide conditions under which these geometric 
properties hold for the observed adjacency matrix. A, which guarantees the consistency of our algorithm: 
the K communities are recovered with probability tending to 1 in the limit of a large number of nodes n. 
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2.5 Scaling 


To study the performance of our algorithm when the number of nodes n grows, we introduce a degree 
parameter so that the expected adjacency matrix of a graph with n nodes is in fact given by 

A = —ZBZ'^, 
n 

with B 6 [0, independent of n and Z e {0, Although Z depends on re, we do not make it 

explicit in the notation. Observe that the expected degree of each node grows like an, since 

di = an i—ZiBZ^\ 

\ re 


where 1 is the vector of one’s of dimension re. 

We assume that the set of subcommunities T does not depend on re and that for all z ^T, there exists 
a positive constant (independent of re) (5z such that: 


re 


( 2 ) 


This implies the existence of positive constants and of a matrix O e such that 


Vz e T, —zBZ^l Lz 
re 


and -Z^ Z 
re 


O. 


(3) 


One has di ~ anLz for any i such that Zi = z. In the sequel, we assume that the graph is sparse in the 
sense that ->■ oo with a^/re ->• 0. Observe also that is the (limit) proportion of nodes that belong 
to community k while Ok^i is the (limit) proportion of nodes that belong to communities k and I, for any 
k 1. Hence we refer to O as the overlap matrix. 

In the following, we will slightly abuse notation by writing O = -Z'^Z and di = anLz if Zi = z, 
although these equalities in fact hold only in the limit. 


3 Related work 

Models. Several random graph models have been proposed in the literature to model networks with 
overlapping communities. In these models, each node i is characterized by some community member¬ 
ship vector Zi that is not always a binary vector, as in the SBMO. In the Mixed-Membership Stochastic 
Blockmodel (MMSB) [Airoldi et ah, 2008], introduced as the first model with overlaps, membership 
vectors are probability vectors drawn from a Dirichlet distribution. In this model, conditionally to Zi and 
Zj, the probability that nodes i and j are connected is ZiBZj for some community connectivity matrix 
B, just like in SBMO. However, the fact that Zi and Zj are probability vectors makes the model less 
interpretable. In particular, the probability that two nodes nodes are connected does not necessarily in¬ 
crease with the number of communities that they have in common, as pointed out by Yang and Leskovec 
[Yang and Leskovec, 2012], which contradicts a tendency empirically observed in social networks. 

A first model that relies on binary membership vectors is the Overlapping Stochastic Block Model 
(OSBM) [Latouche et ah, 2011], in which two nodes i,j are connected with probability a{ZiWZj + 
ZiV + ZjU + w), where W e , U,V € M;^, tu e M, and a is the sigmoid function. Now the proba¬ 
bility of connectivity of two nodes increases with the number of communities shared, but the particular 
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form of the probability of connection makes the model hard to analyze. Given a community connectivity 
matrix B, another natural way to build a random graph model based on binary membership vectors is 
to assume that two nodes i and j are connected if any pair of communities k,l to which these nodes 
respectively belong can explain the connection. In other words, i and j are connected with probability 
1 - nfe^z=i(l . Denoting by Q the matrix with entries Qk,i = - log{l -B^ i), this probability 

can be written 1 - exp (^-ZiQZj) ^ ZiQZj, where the approximation is valid for sparse networks. In 
this case, the model is very close to the SBMO, with connectivity matrix Q. The Community-Affiliation 
Graph Model (AGM) [Yang and Leskovec, 2012] is a particular case of this model in which B is di¬ 
agonal. The SBMO with a diagonal connectivity matrix can be viewed as a particular instance of an 
Additive Clustering model [Shepard and Arabic, 1979] and is also related to the ‘colored edges’ model 
[Ball et ah, 201 1], in which Ai , is drawn from a Poisson distribution with mean 9i0j, where 9i e 

_ J 

is the (non-binary) membership vector of node i. Letting 9i = ^yW^Zi and approximating the Poisson 
distribution by a Bernoulli distribution, we recover the SBMO. 

The Overlapping Continuous Community Assignment Model (OCCAM), proposed by Zhang et al. 
[Zhang et ah, 2014] relies on overlapping communities but also on individual degree parameters, which 
generalizes the degree-corrected stochastic blockmodel [Karrer and Newman, 2011]. In the OCCAM, a 
degree parameter 9i is associated to each node i. Letting 0 = Diag(0j) e the expected adjacency 

matrix is A = QZBZ^Q, with a membership matrix Z e . Identifiability of the model is proved 
assuming that B is positive definite, each row Zi satisfies \\Zi\\ = 1, and the degree parameters satisfy 
n~^ Er=i - 1- The SBMO can be viewed as a particular instance of the OCCAM, for which we provide 
new identifiability conditions, that allow for binary membership vectors. 

Algorithms. Several algorithmic methods have been proposed to identify overlapping community 
structure in networks [Xie et ah, 2013]. Among the model-based methods, that rely on the assumption 
that the observed network is drawn under a random graph model, some are approximations of the max¬ 
imum likelihood or maximum a posteriori estimate of the membership vectors under one of the random 
graph models discussed above. For example, under the MMSB or the OSBM the membership vectors are 
assumed to be drawn from a probability (prior) distribution, and variational EM algorithms are proposed 
to approximate the posterior distributions [Airoldi et ah, 2008, Latouche et ah, 2011]. However, there 
is no proof of consistency of the proposed algorithms. In the MMSB, a different approach that uses 
tensor power iteration is proposed in [Anandkumar et ah, 2014] to compute an estimator derived using 
the moments method, for which the first consistency results are provided. 

The first occurrence of a spectral algorithm to find overlapping communities goes back to [Zhang 
et ah, 2007]. The proposed method is an adaptation of spectral clustering with the normalized Laplacian 
(see e.g., [Newman, 2013]) with a fuzzy clustering algorithm in place of /c-means, and its justification is 
rather heuristic. Another spectral algorithm has been proposed by [Zhang et ah, 2014], as an estimation 
procedure for the (non-binary) membership matrix under the OCCAM. The spectral embedding is a 
row-normalized version of e , with A the diagonal matrix containing K leading eigenvalues 

of A and U the matrix of associated eigenvectors. The centroids obtained by a /c-median clustering 
algorithm are then used to estimate Z. This algorithm is proved to be consistent under the OCCAM, 
when moreover degree parameters and membership vectors are drawn according to some distributions. 
Similar assumptions have appeared before in the proof of consistency of some community detection 
algorithms in the SBM or DC-SBM [Zhao et ah, 2012]. Our consistency results are established for fixed 
parameters of the model. 
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4 Spectral analysis of the adjacency matrix in the SBMO 

In this section, we describe the spectral structure of the adjacency matrix in the SBMO. 

4.1 Expected adjacency matrix 

Let Z be the set of membership matrices that contains at least one pure node per community: 

Z = {Z€{0,ir^'^, Vk = l,...,K, = = 

i 

From the identifiability conditions (SBMOl) and (SBM02), A = ZBZ"^ is of rank K (refer to the proof 
of Theorem 3) and Z belongs to Z. Let U e be a matrix whose columns ni,..., uk ^ are 

normalized orthogonal eigenvectors associated to the K non-zero eigenvalues of A. The structure of U 
is described in the following proposition. Its first statement follows from the fact that the eigenvectors 
ui ,..., UK form a basis of Im(74) and that Im(74) c Im(Z). Its second statement is established in the 
proof of Theorem 3. 

Proposition 4. 1. There exists X e such that U = ZX. 

2. If U = Z'X' for some Z' € Z, X' € then there exists a e 6k such that Z = Z'P„. 

This decomposition reveals in particular an additive structure in U : each row Ui is the sum of rows 
corresponding to pure nodes associated to the communities to which node i belongs. Fixing for each k a 
pure node ik in community k, one has indeed 

Vi, f/j = ^ ( 4 ) 

fc=i 

Proposition 5, proved in Appendix A, relates the eigenvectors of A to those ofaKxK matrix featuring 
the overlap matrix O introduced in Section 2.5. Note that for any x e we have x^Ox = \Zx\^ jn so 
that O has the same rank as Z, equal to K. Hence O is invertible and positive definite, thus the matrix 
(resp. its inverse) is well defined. 

Proposition 5. Let ^40 and Mq = The following statements are equivalent: 

1. u = Zx is an eigenvector of A associated to and- 

2. O^^^x is an eigenvector of Mq associated to ;u; 

In particular, the non-zero eigenvalues of A are of the same order as an- 

4.2 Observed adjacency matrix 

In practice, we observe the adjacency matrix A, which is as a noisy version of A. Our hope is that the 
K leading eigenvectors of A are not too far from the K leading eigenvectors of A, so that in view of 
Proposition 4, the solution in Z' the following optimization problem provides a good estimate of Z: 

min \\U-Z'X'\\f, 


where (7 is the matrix of the K normalized eigenvectors of A associated to the K largest eigenvalues. 


This hope is supported by the following result on the perturbation of the largest eigenvectors of the 
adjacency matrix of any random graph, proved in Appendix D. In practice, the number of communities 
K is unknown and this result also provides an adaptive procedure to select the eigenvectors to use in the 
spectral embedding. We denote by Amin(^) the smallest absolute value of a non-zero eigenvalue of A. 

Lemma 6. Let 6 e]0, 1[ and rj ejO, l/2[. Let U be a matrix formed by orthogonal eigenvectors of A with 
an associated eigenvalue A that satisfy 

|A| > \/2(l -Hr/)dmaxlog(4n/5). 


Let K be the number of such eigenvectors. Let Lf be matrix of K largest eigenvectors of A. If 
i{2r} + 3){2 + r]) 


dn 


37/2 


-(f) 


and Wd)!> 

^max 




then with probability larger than 1- 8, K - rank(A) and there exists a matrix P e 0^(1^) such that 


||f/-(7P||;<32(l + 


V 

rj + 2 


dn 


Amin(^)^ 



Under SBMO, we have Amin(^) = ©(««,) by Proposition 5. As dmax = &{ctn), we need Unj log(n) ^ 
-I- cx) to use Lemma 6 to prove that ?7 is a good estimate of U. We give in the next section sufficient con¬ 
ditions on the degree parameter an to obtain asymptotically exact recovery of the communities. 


5 The SAAC algorithm 

The spectral structure of the adjacency matrix suggests that Z defined below is a good esfimafe of fhe 
membership mafrix Z in fhe SBMO: 

(V): iZ,X)€ argmin \\U-Z'X'\\l, (5) 


where 17 6 is fhe mafrix of fhe K normalized leading eigenvectors of A. In pracfice, solving {V) 

is very hard, and fhe algorifhm infroduced in Secfion 5.1 solves a relaxation of {V) in which Z' is only 
consfrained to have binary enfries, fhaf is amenable fo alfemafe minimizafion. In Secfion 5.2, we prove 
fhat an adapfive version of fhe esfimafe Z given by (5) is consisfenf. 

5.1 Description of the algorithm 

The spectral algorithm with additive clustering (SAAC) consisfs in firsf compufing a mafrix U e 
whose columns are normalized eigenvectors of A associafed fo fhe K largesf eigenvalues (in absolufe 
value), and fhen computing the solution of the following optimization problem: 

{V)': {Z,X)€ argmin \\U-Z'X'\\l. 

jY'gRifxTf 

(V)' is reminiscent of the (NP-hard) /c-means problem, in which the same objective function is min¬ 
imized under the additional constraint that \\Zi\\ = 1 for all i. The name of the algorithm highlights 
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the fact that, rather than finding a clustering of the rows of U, the goal is to find Z, confaining pure 
nodes ii,.. .,ik, thaf reveals fhe underlying additive sfrucfure of U: for all i, Ui is nof too far from 
view of (4). 

In pracfice, jusf like /c-means, we propose fo solve (V)' by an alfernafe minimization over Z' and 
X'. The proposed implemenfafion of the adaptive version of the algorithm, inspired by Theorem 8, is 
presented as Algorithm 1. An upper bound m on the maximum overlap Omax = max{||z||,2; e T} is 
provided to limit the combinatorial complexity of the algorithm. If K if known, the selection phase can 
be removed, and one use directly the matrix JJ e of K leading eigenvectors. While heuristics do 

exist for selecting the number of clusters in spectral clustering (e.g. [Von Luxburg, 2007, Zelnik-Manor 
and Perona, 2004]), this thresholding procedure is supported by theory for networks drawn under SBMO. 
It is reminiscent of the USVT algorithm of [Chatterjee, 2015], that can be used to estimate the expected 
adjacency matrix in a SBM. 


Algorithm 1 Adaptive SAAC 

Require: Parameters e,r,r]> 0. Upper bound m on the maximum overlap Omax- 
Require: A, the adjacency matrix of the observed graph. 

1: H Selection of the eigenvectors 

2: Form U a matrix whose columns are K eigenvectors of A associated to eigenvalues A satisfying 

|A| > \/2(l + ?7)d max log(4ni+'’) 

3: |j Initialization 
4: Z = 

5: X € initialized with fc-means++ applied to U, the first centroid being chosen at random 

among nodes with degree smaller than the median degree 
6: Loss = + 0 O 

7: |j Alternating minimization 
8: while {Loss - ||f7 - ZX|||, > e) do 
9: Loss=\\U-ZX\\l 

10: Update membership vectors: Vf, = argmin - 2 ;X||. 

11: Update centroids: X = {Z'^Z)~^Z'^U. 

12 : end while 


Alternate minimization is guaranteed to converge, in a finite number of steps, towards a local mini¬ 
mum of \\ZX - U\\'jp. However, the convergence is very sensitive to initialization. We use a A:-means+-i- 
initialization (see [Arthur and Vassilvitskii, 2007]), which is a randomized procedure that picks as initial 
centroids rows from U that should be far from each other. For the first centroid, we choose at random a 
row in IJ corresponding to a node whose degree is smaller than the median degree in the network. We 
do so because in the SBMO model, pure nodes tend to have smaller degrees and we expect the algorithm 
to work well if the initial centroids are chosen not too far from rows in U corresponding to pure nodes. 

Given Z, as long as the matrix Z'^Z is invertible, there is a closed form solution to the minimization 
of \\ZX - U\\f in X, which is X = {Z'^Z)~^Z'^U. The fact that Z'^Z is not invertible implies in 
particular that Z does not contain a pure node for each community. If this happens, we re-initialize the 
centroids, using again the A:-means+-i- procedure. 
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5.2 Consistency of an adaptive estimator 

We give in Theorem 8 theoretical properties for a slight variant of the estimate Z in (5), that is solution 
of the optimization problem ('P^) defined therein, that features the set of membership matrices for which 
the proportion of pure nodes in each community is larger than e: 

2, = \z' e {0,1}”^-^, V/c e {l,-,iT} , ~ > e| • 

Recall the notation introduced in (2) and (3). We assume that e is smaller than the smallest proportion of 
pure nodes (in the limit), given by miiifc and let Lmax = max^ 

The estimator analyzed is adaptive, for it relies on an estimate K of the number of communities, 
and on = Z^{K). We establish its consistency for any fixed matrices B and Z satisfying (SBMOl) 
and (SBM02). It is to be noted that while the consistency result for the OCCAM algorithm [Zhang 
et ah, 2014] applies to moderately dense graphs (a^ has to be of order n" for some a > 0), our result 
handle relatively sparse graphs, in which an is of order (log(n))^’'''^ for some c > 0. Our result involves 
constants defined below, thaf are relafed to the overlap matrix O and to the matrix introduced 

in Proposition 5. 

Definition 7. The core matrix is the K x K symmetric matrix Mq ;= We let 


To 

do 


min{|A| ■ 0 is an eigenvalue o/Mq}, 


mm 

zi{-i,o,i,2y^^ 

zi=0 




> 0 . 


Note that do is positive as seen by the following argument: if do - 0, then there would exist a linear 
combination of the rows of which is zero; this is impossible because the matrix 0“^/^ is invertible. 

Theorem 8. Let rj e]0, l/2[ and r > 0. Let U be a matrix whose columns are orthogonal eigenvectors of 
A associated to an eigenvalue A satisfying 

|A| > s/2 {1 + 7]) dmaxlog(4ni+’’). 

Let K be the number of such eigenvectors. Let 

{Ve)- {Z,X)€ argmin \\Z'X'-U\\l. 


Assume that ->• oo and e < min^ There exists some constant ci > 0 such that, if 


an > max 


4(2r? + 3)(2 + 7?) 
3f?^T/max 




1 + -JTTf] 


tI 


log (4n^^^) 


then, for n large enough, with probability larger than 1 - n K = K and 

MisC(Z,Z)^ 7^2^„,ax log(4ni+") 


< Cl- 


n OqPq an 

In particular, assuming that Q!n/log(n)->-oo when n goes to infinity, it can be shown (using the 
Borel-Cantelli Lemma) that the estimation procedure described in Theorem 8 with a parameter r > 2 is 
consistent, in the sense that it satisfies 


MisC(Z,Z) 

n 



n->oo 


11 


















Theoretical guarantees for other estimates. Theorem 8 leads to an upper bound on the estimation 
error of Z a solution to ('Pg). In some cases, it is also possible to prove directly that the solution of 
{Vy leads to a consistent estimate of Z. This is the case for instance in an identifiable SBMO with two 
overlapping communities or with three communities with pairwise overlaps. 

If K is known, tighter results can be obtained for non-adaptive procedures in which U e is 

replaced by 17 e These results are stated in Appendix C, where two non-adaptive estimation 

procedures are shown to be consistent under the (looser) condition a > cq log(n) for some constant cq 
stated therein. 

5.3 Proof of Theorem 8 

Let U e be a matrix whose columns are K independent normalized eigenvectors of A associated 

to the non-zero eigenvalues. The proof strongly relies on the following decomposition of U, that is a 
consequence of Proposition 5. 

Lemma 9. There exists a matrix V € of eigenvectors of Mq = such that U = ZX 

with X = . 

We state below a crucial result characterizing the sensitivity to noise of the decomposition U = ZX 
of Lemma 9, in terms of the quantity do introduced in Definition 7. The proof of this key result is given 
in Appendix B: it builds on fact that do provides a lower bound on the norm of some particular linear 
combinations of the rows of X : indeed, one has 

V z e {-1,0,1, 2}^><^\{0}, ||zX|| > do/V^. 

Lemma 10. (Robustness to noise) Let Z' e X' e and Af c (1,... , n}. Assume that 

2. there exists (A,... ... ,jx) e (Af^): VA; e [l,iT], Zi^ = Z'^^ = 

Then there exists a permutation matrix such that for all i e Af, Zi = (Z'P„)i. 

Let U the matrix defined in Theorem 8. We first note that Lemma 6 can be rephrased in terms of 
the degree parameter Indeed, from Proposition 5, Amin(^) = anPo> with po in Definition 7 and 
^max “ f^max, with ^ 

f^max = max ( -ZiBZ'^ln,! ] . 

1=1...n \ n } 

From Lemma 6, letting 


C'o(r/) = max 


4(2?? + 3)(2-H7?) _ / / h y + yi + v 
’V I 2 + r]) pi 


if an > Coip) log then with probability larger than 1 - n K = K and there exists a rotation 

P e Ok{^) such that 


lit/32^1 + 


TJ \ 

1 -^max j 

( log(4n^’^'’) 

r] + 2j 




( 6 ) 
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In the sequel, we assume that K = K and that this inequality holds with a rotation P. 
The estimate Z,X is then defined by 

argmin \\Z'X'-IJ\\\. 


Introducing X\ := XP we first show that ZXi is a good estimate of U provided that U is: 


\\ZXi-U\\f<2\\uP-U\\f. 


( 7 ) 


This inequality can be obtained in the following way. Let X, Z be defined in Lemma 9. As Z e (for 
e < minfc by definition of Z and X, 

lizx - uwl < \\zxp - uwl = \\up - uwl- 

Then, one has 

\\ZXP-^-U\\f < \\ZXP-^-UP-\\f + \\UP~^-U\\f = \\ZX-U\\f + \\U-UP\\f 
< 2||f7-f7P||^. 

We now introduce the set of nodes 


and show that assumption 1. and 2. in Lemma 10 are satisfied for this set and the pair (Z, Xi), if 


^-^\\U-UPfF<e. ( 8 ) 

Assumption 1. is satisfied by definition of Mn- We now show that, as required by assumption 2., Mn 
contains one pure node in each community relatively to Z and Z. 

First, using notably (7), the cardinality of is upper bounded as 


^ 16 iF 


2 n 


n 


n 


4 i=i 


Y,\\z^x,-Uif = 


16K^ 

Uq 


\zx,-u\\l<^\\u-uP\\l. 


Thus, if (8) holds, < en. As Z e Ze(iT), for aWk < K the cardinality of the set of nodes i such that 
Zj = Ijfcj, is strictly larger than en, hence this set cannot be included in J\f^. Thus, for all k, there exists 
jk ^ such that Zjj. = As e is smaller than min^ the minimal proportion of pure nodes 

in a community, by a similar argument the set of nodes i such that Zj = cannot be included in J\f^ 
either. Thus for all k, there exists e Mn such that Zj^, = 

Hence Lemma 10 can be applied and there exists a e &k such that Vi e Mn, Zj = Zj fc: up to a 
permutation of the community labels, all the communities of nodes in Mn are recovered. Using (6), this 
implies that whenever > C'o(r/) log with probability larger than 1 - n”'’, 

MisC(Z,Z) ^ < 2048iT^L^ax L \ log(4n^"-) ^ 

n n dg dg/Tg \ r]+2l an 

provided that the final upper bound is smaller that e (which implies that the condition (8) is satisfies), 
which is the case for n large enough. 
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6 Experimental results 


We mostly use the estimation error to evaluate the quality of an estimate Z of some membership matrix 
Z, that we recall is defined by 

Error(Z, Z') = min \\ZPa- - Z\i^p. 

TlK criGx 


This error can be split into two kinds of errors: entries that are ones in (where a* realizes the 
minimum above) but zeros in Z, called false positive, and entries that are zeros in but ones in Z, 
called false negative. We define fhe false positive and false negative rates as 


FP(Z,Z) 


1(f) ^ and Zi^]^ 0| 

\{i,k) ■■ Zi^k = 1| 


FN(Z,Z) 


|(f, A:) ■ 0 and Z^^k 1| 

\{i,k) : Zi^k = 0| 


An extension of the normalized variation of information (NVI) introduced by [Fancichinetti et ah, 
2009] is also used as a measure of performance in several papers. This indicator compares the distribution 
of two random vectors X = (Xi,..., Xp) and Y = ..., Yp) in {0, 1}^ associated to Z and Z 

respectively, such that the joint distribution of any two marginal is given by 


The NVI is defined by 


F{Xk = Xk,Yi = yi) 


i ■■ Zj^k = Xk and Z^ = yi\ 
n 


1 


K 


NVI(Z, Z) = 1 - min - V 

cT^@K 2iT 


fc=i L 


H{XkK^k)) _^H{Y,^k)\Xk) 


H{Xk) 


H{Y„{k)) \ 


where H{V) and H{V\W) denote respectively the entropy of a random variable V and the conditional 
entropy of V given W (see e.g. [Cover and Thomas, 2006] for definition). Unlike the other performance 
measures that we consider, the NVI should be maximized. 

Our analysis shows that for a graph drawn under the SBMO the error of SAAC goes to zero al¬ 
most surely when the number of nodes n grows large, and the degrees are large enough, more precisely 
(slightly more than) logarithmic in n. We illustrate this fact on simulated data, and compare SAAC to 
other (spectral) algorithms on simulated data and on two kinds of real-world graphs with overlapping 
communities : ego networks and co-authorship networks. 


6.1 Simulated data 

We compare SAAC to (normalized) spectral clustering using the adjacency matrix, referred to as SC and 
to the spectral algorithm proposed by [Zhang et ah, 2014] to fit the random graph model called OCCAM. 
We refer to this algorithm as the OCCAM spectral method. 

First, we generate networks from SBMO models with n = 500 nodes, K = 5 communities, = 
log^'^(n), B = Diag([5,4,3,3,3]) and Z drawn at random in such a way that each community has a 
fraction of pure nodes equal to pjK for some parameter p and the size of the maximum overlap Omax 
is smaller than 3. The left part of Figure 2 shows the error of each method as a function of p, averaged 
over 100 networks. SAAC significantly outperforms OCCAM, especially when there is a large overlap 
between communities. As expected, both methods outperform SC, which is designed to handle non¬ 
overlapping communities, except when the amount of overlap gets really small. 
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To have a more fair comparison with the OCCAM spectral algorithm, we then draw networks under 
a modified version of the model used before, in which the rows of Z are normalized, so that for all i, one 
has \ \Zi\\ = 1: this random graph model is a particular instance of the OCCAM. Results are displayed on 
the right part of Figure 2. The OCCAM spectral algorithm, designed to fit this model, performs most of 
the time slightly better than the other methods, but the gap between OCCAM and SAAC is very narrow. 




Figure 2: Comparison of SC, SAAC and the OCCAM spectral algorithm under instances of SBMO 

(left) and OCCAM (right) random graph models. 


6.2 Real networks 

[Zhang et ah, 2014] compare the performance of the OCCAM spectral algorithm to that of other al¬ 
gorithms on both simulated data and real data, namely ego networks [Me Auley and Leskovec, 2012]. 
Nodes in an ego network are the set of friends of a given central node in a social network, and edges in¬ 
dicate friendship relationships between these nodes. We first apply SAAC on networks from this dataset, 
that naturally contain overlap. To do so, we use the pre-processing of the networks described in [Zhang 
et ah, 2014], that especially keeps communities if they have at least a fraction of pure nodes equal to 10% 
of the network. Additionally, because the focus is on overlapping communities, we keep only networks 
for which the fraction of nodes that belong to more than one community is larger than 1%. This leads us 
to keep only 6 (out of 10) Facebook networks (labeled 0, 414, 686, 698, 1912 and 3437 in the dataset) 
and 26 (out of 133) Google Plus networks from the original dataset. 

Table 1 presents the characteristics of the Facebook networks used, and the performance of SC, 
SAAC and OCCAM, averaged over the 6 networks used (with the standard deviation added). For each 
algorithm, the estimation error is displayed but also the fraction of false positive (FP) and false negative 
(FN) entries in Z, and the extended normalized variation of information (NVI). The parameter c corre¬ 
sponds to the average number of communities per node, c = Y.i,k and Omax is the maximum size 

of an overlap. OCCAM and SAAC have comparable performance, but there is no significant improve¬ 
ment over spectral clustering. This can be explained by the fact that the amount of overlap (c) is very 
small in this dataset. The same tendency was observed on the Google Plus networks. 

We then try SAAC on co-authorship networks built from DBLP in the following way. Nodes corre¬ 
spond to authors and we fix as ground-frufh communifies some conferences (or group of conferences): 
an aufhor belongs fo some communify if she/he has published af leasf one paper in fhe corresponding 
conference(s). We fhen build fhe nefwork of aufhors by puffing an edge befween aufhors if fhey have 
published a paper fogefher in one of fhe considered conferences. We presenf resulfs for some confer- 
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n 

K 

c 

Omax 

FP 

FN 

Error 

NVI 

SC 

190 

(173) 

3.17 

(1.07) 

1.09 

(0.06) 

2.17 

(0.37) 

0.200 

( 0 . 110 ) 

0.139 

(0.107) 

0.120 

(0.083) 

0.556 

(0.256) 

OCCAM 

190 

(173) 

3.17 

(1.07) 

1.09 

(0.06) 

2.17 

(0.37) 

0.176 

(0.176) 

0.113 

(0.084) 

0.127 

( 0 . 102 ) 

0.556 

(0.280) 

SAAC 

190 

(173) 

3.17 

(1.07) 

1.09 

(0.06) 

2.17 

(0.37) 

0.125 

(0.067) 

0.101 

(0.062) 

0.102 

(0.049) 

0.544 

(0.217) 


Table 1: Spectral algorithms recovering overlapping friend circles in ego-networks from Facebook. 

ences with machine learning in their scopes : ICML, NIPS, and two theory-oriented conferences that we 
group together, ALT and COLT. We compare the three spectral algorithms in terms of estimation error 
and false positive / false negative rates. Results are presented in Table 2, in which the estimated amount 
of overlap c = Y,ik ^i,kln is also reported. In this case, SAAC and OCCAM significantly outperform 
SC, although the error is relatively high. The amount of overlap is under-estimated by both algorithms, 
but SAAC appears to recover slightly more overlapping nodes. The difficulty of recovering communities 
in that case may come from the fact that the networks constructed are very sparse. Indeed, we propose in 
Appendix E preliminary experiments that illustrate the difficulty of recovering overlapping communities 
in very sparse networks. 

Cl = {ICML}, C 2 = [ALT, COLT}. Ci = {NIPS}, C 2 = {/CML}, C 3 = {ALT, COLT} 

n = 4374, K = 2, dmean = 3.8 ^ “ 9272, K = 3, dmean = 4:.5 



c 

c 

FP 

FN 

Error 

SC 

1.22 

1 . 

0.38 

0.39 

0.39 

OCCAM 

1.22 

1.02 

0.25 

0.28 

0.27 

SAAC 

1.22 

1.04 

0.26 

0.28 

0.27 



c 

c 

FP 

FN 

Error 

SC 

1.09 

1 . 

0.39 

0.55 

0.46 

OCCAM 

1.09 

1.00 

0.2 

0.34 

0.26 

SAAC 

1.09 

1.03 

0.21 

0.31 

0.25 


Table 2: Spectral algorithms recovering overlapping machine learning conferences 


7 Conclusion 

Most existing algorithms for community detection assume non overlapping communities. Although they 
may in principle be used to detect all subcommunities generated by the various overlaps, this is not 
sufficient to recover the initial communities due to the combinatorial complexity of the corresponding 
mapping. We have proposed a spectral algorithm, SAAC, that works directly on the overlapping commu¬ 
nities, using the specific geometry of the eigenvectors of the adjacency matrix under the SBMO. We have 
proved the consistency of this algorithm under the SBMO, provided each community has some positive 
fraction of pure nodes and the expected node degree is at least logarithmic, and tested its performance 
on both simulated and real data. This work has raised many interesting issues. First, it would be worth 
relaxing the assumption that each community has some positive fraction of pure nodes. Next, the ex¬ 
periments on simulated data have shown threshold phenomena in the very sparse regime that should be 
further explored (see Appendix E). Finally, the proof of consistency actually assumes that the underlying 
(NP-hard) optimization problem is solved exactly while this is not feasible in practice and heuristics 
need to be applied, like the proposed alternate optimization procedure. Understanding the impact of 
these heuristics on the performance of the algorithm is an interesting future work. 
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A Properties of the SBMO 

A.l Identifiability: proof of Theorem 3 

First note that A = ZBZ^ implies rank(A) < rank(i?). Now condition (SBM02) means that the re¬ 
striction of Z to its K first rows is equal to Ik, up to some reordering of the nodes. This gives rank( A) > 
rank(i?), and thus rank(A) = rank(i3). If B satisfies (SBMOl), then rank(A) = rank(i?) = iT: the 
parameter K is identifiable. 

Now let Z^Z' ^ Z and i?, B' invertible matrices such that A = ZBZ^ = Z'B'Z'^. We show that 
there exists some permutation a e &k such that Z - Z 'and B = P^B'Pj. 

Let [/ be a matrix containing K independent normalized eigenvalues of A associated to non-zero 
eigenvalues. The columns of U form a basis of Im(A). As Im(A) c Im(Z) and Im( A) c Im(Z'), there 
exist invertible matrices X, X' such that U = ZX = Z'X'. As for all /c = 1,..., iT there exists some i 
such that Zj fc = 5i^k, the fc-th row of A is a sum of rows in X', namely 

= E 

USk 

where Sk c {1,...,A'}. Similarly, each row of X' is a sum of rows in X. In particular, for any k t I, 
there exist K integers ai,..., such that: 

K 

Xk + Xi= Y. O-mXm- 

m=i 

If n 5/ ^ 0, there exists some m such that am > 2. But this is in contradiction with the fact that X 
is invertible. Hence, Sk Si - 0 for all k 1. The only way for the Sk to be pairwise disjoint is that 
there exists a permutation a such that X' = P^X. Since ZX = Z'X' and X is invertible, this implies 
Z = Z'Pfj. We deduce that ZBZ^ = ZP^-iB'P^iZ"^ and B = P^-iB'Pj_i, by the injectivity of Z. 


A.2 Identifiability for SBM: proof of Proposition 2 

We simply prove that two nodes i, j are in the same community if and only if Aj = Aj. This implies the 
identifiability of the model: it is indeed sufficient to group nodes whose rows in A are identical. Let f, j be 
such that Aj = Aj. If Zj ^ Zj then BZj^ t BZj by assumption (SBMl) and Aj = ZBZf t ZBZj = Aj 
by assumption (SBM2), a contradiction. Conversely, Zj = Zj clearly implies Aj = Aj. 


A.3 Spectrum of the adjacency matrix 

Proof of Proposition 5. As any non zero-eigenvector of A belongs to Im(A) c Im(Z), if u is an 


eigenvector of A associated to ^ 0, there exists 
are equivalent: 


A(Zx) 

—ZBZ^Zx 

n 



BOx 


5C)1/2(o1/2^) ^ 
q1/2^q1/2(q1/2^^ 


such that u = Zx. The following statements 

= OnfJ^iZx) 

= UniaZx 

= fiZx 
= fix 

= fiO-^l\0^l^x) 

= 
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Hence Zx is an eigenvector of A associated to if and only if 0^1‘^x is an eigenvector of 
associated to //, which concludes the proof. 

B Key ingredients in the proof of Theorem 8 

B.l Proof of Lemma 9: Decomposition. 

yJnU contains independent eigenvectors of A associated to non-zero eigenvalues. From the first state¬ 
ment in Proposition 5, there exists a matrix V of eigenvectors of Mq such that \/nU = ZO~^I^V. As U 
contains normalized eigenvectors, U'^U = Ik, which yields V'^V = Ik and V e Ok(^)- 

B.2 Proof of Lemma 10: Sensitivity to noise 

Recall that from Proposition 9, there exists V e such that the matrix of leading eigenvectors U 

can be written 

U= ZX with X = 

\fn 

Using that \\zX\\ = \\zO~^l‘^\\l\/n, the following inequality is a consequence of the definition of do 
(Definition 7): 

Vze{-l,0,l,2}i><^\{0}, \\zX\\>^. (9) 

yn 

Let ii,... ,iK (resp. ji,..., Jk) be pure nodes in M relatively to Z (resp. Z') that belong to com¬ 
munities 1,... K: Zii_, = Ij;.} (resp. We first prove that fi,..., fx are also pure nodes 

relatively to Z'. For any k = 1,... ,K, Z[^ can be written as a sum of pure nodes relatively to Z\ there 
exists a set c {1,..., n} such that 

= E 

miSk 

Let A; ^ As and ii belong to M, 

\\{Zl + Z')X' - (Z,, + ZJXII < wzl^x' - Z,,X|| + HZ'X' - z,, All < ^ 
and 

\\(zi^z’,)x’-( E E Z,„)X\\ = IK j: z',.+ j: z'„)x'-( E z,„h- E z,jxii 

^ S V„X’-Z,„X\\^ Y, l|Z'„^'-Z,„X|| 

m^Sk m^Si 


2x/n 

This proves that 

IK E E Z,jX-{Z,^ + Z,^)X\\<dolV^. 

m^Sk m^Si 

If Sk^Si t 0, there exists z e {0,1,2, -1}\{0} such that ||z;A|| < dol^/n, which contradicts (9). Thus 
n 5; = 0. Hence, the support of the Z'^ are all disjoints, thus they must be distinct pure nodes. There 
exists a permutation a € &k such that 

Ik=l,...,K, Zi^=l{k} and Z'^ = l|^(fc)|. 
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To conclude the proof, we show that for a the permutation defined above, it holds that 


Vi e AT, V/c e {1, iT}, ~ 


Let i € M. There exists a set 5 c {1,... ,n} such that Zi = Y,k^s H sufficient to prove that 
Z' = Y,kiS Iftrffc)}- To do so, we first introduce C = {0,1}^^^\{0} and the following important mapping: 


^>: C 

z 


C 


^ y : zX' e TZy, 

where is partitioned into the following 2^-1 regions indexed by ?/ e C, 

ny = {x€ : ||x - yX\\ < ||x - y'X \\for all y' €C,y' ^ z}. 

The following lemma gathers useful properties of the mapping <1>. Its proof is given below. 
Lemma 11. $ is a one-to-one mapping satisfying \ \zX' - yX\\ < => ^(z) = y. 

As f e M, from assumption L, 


\z’x’ -ZiXW < 


dp 

2^/n 


Moreover, using that l{fc} = Hk^s and = Hk^s 


(z 


^<7{k) 


X' - ZiX 


(Ezi)x'-(Ez^^x 

\k^S / \fce5 / 


Z\Kx'-z,,x\\< 


kiS 


dp 

2yn 


Using Lemma 11, the last two inequalities yield <h(Z') = Zi and ‘h lo-(fc)) = Zi respectively. 

Using that <1> is one-to-one (again from Lemma 1 1) concludes the proof: 

z'i = ZUk)- 

kiS 


Proof of Lemma 11. Let z,y €Cbe such that \\zX' - yX\\ < Let y' € C : y' y. Using (9), 

||zA' - y'X\\ > ||zX - y'X\\- ||zX' - yM\\ > A _ ^ > ^ > ||^x' - yX\\. 

yjn 2\/n 2\/n 

Hence, zX' e TZy and <h(z) = y, which proves the second part of the result. 

We now prove that <1> is one-to-one. Let y e C: there exists a set 5 £ {1,..., iT} such that y = 
TmeS l{fc} = Em £5 Zi^ .Letz= ■ As the Z'^ are disjoint indicators, one has z€C. Moreover, 

\\,X'-yX\\<'Z\\z;^X'-Z,„.X\\<X 

meS 

From what we’ve just proved, this implies <h( 2 ;) = y. As C is finite and \fy € C,3z € C ■ <h( 2 ;) = y, <h is 
one-to-one. 
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C Results for non-adaptive procedures 


We present here tighter upper bounds on the fraction of nodes that are misclassified by some non-adaptive 
estimation procedures, based onU € rather than on ?7 e (with K given in Theorem 8 ). In 

this case, it is possible to analyze the solution of (Pe), defined in Section 5.2, as well as the solution of 
the following optimization problem: 

{Vr)- min \\Z'X'-U\\l. 


{Vt) relies on the knowledge of T, the set of subcommunities that are present in the network. If one 
has this knowledge, note that the above estimate can be computed using alternate minimization, just 
like the solution of {V)'. Theorem 12 below gathers the theoretical guarantees obtained for these two 
estimators. Compared to Theorem 8 , a logarithmic factor is removed in the upper bound on the number 
of misclassified nodes: bofh esfimafes are consisfenf provided fhaf 


an > max 


Lr. 


■log(n);% 

/^o 


Theorem 12. Let U be a matrix formed by K independent eigenvectors associated to the eigenvalues 
of A that are largest in absolute value. Let (Z, C) be the solution of {Vf) or of (Vp). There exists a 
constant 6*2 > 0 such that for all r > 0 , there exists a constant Cr such that if 


then, for n large enough, with probability larger than 1 - n 

MisC(Z,Z) ^ ^^ C^K^Lrn.. 1 
n ” dl^l an 

The proof of Theorem 12 is very similar fo fhaf of Theorem 8 given in fhe previous secfion. The main 
difference is fhaf in fhe non-adaplive case if is possible fo use a fighfer eigenvecfors perfurbafion resulf 
(specific fo SBMO), fhaf we sfafe below as Lemma 13. Compared fo Lemma 6 , in Lemma 13 an exfra 
logarifhmic facfor is removed, buf af fhe price of non-explicif consfanfs, fhaf do nof permif fo propose an 
adapfive version of fhe resulf. The proof of bofh Lemma 6 and Lemma 13 are given in fhe nexf section. 

Lemma 13. Let A be drawn under a SBMO model with expected adjacency matrix A. Let K be the rank 
of A. Let U (resp. U) be a matrix whose columns are K independent eigenvectors associated to the K 
eigenvalues of A (resp. A) with largest absolute values. 

For all r > 0, there exists a constant Cr such that under the conditions 


an > max 


1 w ^ 

7-log(n);^ 

-^max To. 


'^min(-^) /^max ^ Cp Clfld dmax ^ log (tz) , 

with probability larger than 1 - there exists a matrix P € such that 

Also, compared fo fhaf of {Vf), the analysis of the solution of (Vp) requires a more complex ar¬ 
gument to prove that the set Mn and {Z,Xi) defined in fhe proof of Theorem 8 satisfy assumption 2. 
of Lemma 10, i.e. fhaf Mn confains one pure nodes per community in Z and Z. We presenf below fhe 
argumenf fhaf can be used in fhaf case. 
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Mn contains pure nodes. Under the assumption 

-UPfp < min 
Aq zeT 

|A/’^|/n < j3z for each possible membership vector z ^T. Thus, for all z ^T, the set of nodes i such that 
Zi = z cannot be included in and there exists e Mn such that = z. In particular, Mn contains 
pure nodes relatively to Z. Now we need to prove that it also contains pure nodes relatively to Z. 

To do so, we introduce the following mapping and prove it is one-to-one: 

T T 

z I —> y- zXi e TZy, 

where is partitioned into K = \T\ regions, indexed by y e T, 

'JZy = {x€R^ ||x - yX\\ < ||x - y'X\\ for all yMT-y'* y). 

For all i e Mn, ^(Zj) = Zj. Indeed, ZjXi e IZzi for if 2 /^ ^ is such that y't Zi, using (9) and the fact 
that i belongs to Mn yields 

llZiXi - y'X\\ > \\ZiX - y'X\\- ||ZiXi - ZiXjj > A _ ^ > ^ > \\z,Xi - ZiX\\. 

It follows that for all y ^ T, there exists z € T such that y = ^'(z). Indeed, there exists iy e Mn such 
that Ziy = y, thus ^(Zi^) = y and z = Zi^ belongs to T by definition of the optimization problem that Z 
solves. As T is a finite set, 'F is one-to-one. Thus, one has 

{Zi^ : z € r} = {{Zi^ ■.z^T}) = 'F-' (T) = r. 

In particular, there exists ii,... ,iK (resp. ji,..., j_fr) such that VA:e {!,...,iT}, Zj^, = Zj^, = 

D Proof of the eigenvectors perturbation results 

Lemma 6 and Lemma 13 rely on two main ingredients, described below: a high-probability bound on 
the spectral norm of A - A (i.e. a concentration result), and results from linear algebra, mostly the 
Davis-Kahan theorem. 

D.l Main ingredients 

We state here the matrix concentration result that is used to prove Lemma 6, which is of interest in its 
own. This result is not specific to the DC-SBM, it holds for any random graph model. It follows from a 
Bernstein inequality for sum of independent matrices, and its proof is given in Section D.3. 

Theorem 14. Let 6 e]0, 1[. Let e > 0 be fixed. If 

2 1 -t- e 2n 

dmax > 

9 e"' 0 

one has 

P^||A-A||>y2(l + e)d„,axlog(y)j<<5. 
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Another concentration result, given below, is used to prove Lemma 13. This result, recently obtained 
by [Lei and Rinaldo, 2015] improves the dependency in n in the high-probability upper bound on 11 A-A| |, 
since a logarithmic term is removed compared to Theorem 14. However, the constants in the upper bound 
are non-explicit. 

Theorem 15. [Theorem 5.2 of [Lei and Rinaldo, 201 5[[ In a random graph model, if d is such that 
d> n maxij- Aij and d>co log(n), for every r > 0 there exists a constant C = C{r, cq) such that 

¥(\\A-A\\>C'/d)<5. 

In the proof of Lemma 6, another concentration result is needed to control the deviations of the 
empirical degrees from the mean degrees. The following result follows from Bernstein inequalities (for 
independent random variables) and is proved in Appendix D.4. 

Lemma 16. Let a e]0,1[. 

P (dmax < (1 + a)(imax) > 1 “ 2(1 W3) 

IP(dma.>(l-a)dn,ax) > 1 - 

We state here two useful results from linear algebra, that relate the eigenvalues and eigenvectors of 
two matrices A and B to the difference in spectral norm between the two matrices. 

Lemma 17 (WeyTs inequalities). Let Ai(A) > ■ ■ ■ > An(A) denote the ordered eigenvalues of a symmet¬ 
ric matrix A of size n. For any two symmetric matrices A et B of size n, 

for all i = l,...,n \Xi{A) - Xi{B)\ < \\A - B\\. 

Theorem 18 (Davis-Kahan theorem). Let A and B be two symmetric matrices of size n. Let L cM. be an 
interval that contains exactly k eigenvalues of A and B. Let Xa (resp. Xb) be a matrix in whose 
columns are k independent normalized eigenvectors associated to the eigenvalues of A (resp. B) in L. 
Then there exists a rotation P e such that, with 6 ■= inf{|A - s|, A e sp(B), A s e /}, 

\\Xa-XbP\\f<^\\A-B\\. 

0 

The usual statement of the Davis-Kahan theorem involves principal angles between the column 
spaces of Xa and Xb, but the above formulation can be easily obtained from Proposition B.l of [Rohe 
et ah, 2011] and the following explanation therein relating the principal angles to the Frobenius norm of 
Xa - XbP for some rotation P. 

D.2 Proof of Lemma 6 and Lemma 13 

Let Xk{A) be the eigenvalues of A, and Afc(A) be the eigenvalues of A, sorted in non-increasing order. 
Let s (resp. r) be the number of of eigenvalues of A that are strictly positive (resp. negative), so that 
K = s + r (where K is the rank of A). Using WeyTs inequalities (Lemma 17), one can write 

for A: = l,...s, Afc(A) > Afc(A) - ||A - A||, 

for/c = s + 1,... ,n-r, |Afc(A)| < ||A-A||, 

for/c = n - r -I- 1,..., n, Afc(A) < Afc(A)-i-||A - A||. 
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Let Uk (resp. Uk) be a matrix whose columns are K orthogonal eigenvectors associated to the 
largest eigenvalues (in absolute value) of matrix A (resp. A). The proof of the two lemmas rely on the 
following important statement: 

X^UA)>2\\A-A\\ ^ \\Uk-Uk\\ 1< , \\A-Af. (10) 

We prove (10). If Amin(^) >2\\A- ^||, letting an = Xmm{A)l2, one has 

k€{l,s} => Afc(A) e [an,+oo[ 

k€ {s + l,n-r} XkiA) €]-an,an[ 

k€{n-r + l,n} => Afc(A) e] - oo,-a^] 

In particular, the K eigenvalues of A with largest absolute values are Xk{A), fork e {l,s} (positive 
eigenvalues) and fee {n - r + 1, n} (negative eigenvalues). The matrix Uk can thus be written (up to a 
permutation of the columns) Uk = [U^\U~], where the s columns of U* are normalized eigenvectors 
associated to positive eigenvalues and the r columns of U~ are normalized eigenvectors associated to 
negative eigenvalues. Let Uk = be a matrix of normalized eigenvectors of A decomposed 

similarly (up to the same permutation of columns). 

As := [a„, + oo [ contains exactly s eigenvalues of A and A, from Theorem 18, there exists a matrix 
e 0<j(M) such that 

\\U^-U^PP\f<-^^\\A-A\\. 

Similarly, as :=] - oo, -a„,] contains exactly r eigenvalues of A and A, from Theorem 18, there exists 
a matrix P~ e C)r(M) such that 

\\u--u-P-\\f<-^^\\a-a\\. 

Let P be the block diagonal matrix of size r + s = K with and P~ as first and second block. One has 

- - 18 

\\Uk - UkPWI = lie/" - u*pp\l + \\u- - u-p-\\l < , .,.2 11-^ - ^11'- 

This proves (10). 


Proof of Lemma 6. Let r/ be fixed and let e = r//(2 + ry), so that (l + e)/(l-e) = 1 + ry. Let £, P, Q be 
the three events 


^ ~ ^||^-A|| < y/2(l + e)(i max log(4n/())) 

p = (dmaxCn) < (1 + e)dmax(n)) 

Q = (dmax(n) > (1 - e)limax(ra)) 

and P = £ r\ P nQ. We first show that F{P) > 1 - (5 under the assumption 

^ 2(1 +e/3), /4n\ 

Cimax> - 


( 11 ) 
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From Theorem 14, this condition implies < (i/2. From Lemma 16, one has 


< ne '^““2 (iW3) < 5/4, 

P(g^) < e-'^"--2(i+./3) < ne“‘^”""2(i+./3) < 5/4. 


A union bound then yields ¥{%) > 1 - li. 

From now on, we assume that the event % holds. We first prove that under the extra assumption 


^min(^) ^ C*e\/O^max log(4?T,/(i) 


with 


C, = V2(l + e) 



( 12 ) 


the set 

5^ = |/c: |Afc(i)| > yj 2^^(imaxlog(4n/5) 

coincides with {1, s} u {n - r + 1, n}, thus its cardinality is K. The consequences of Weyl’s inequality 
yield in this particular case, using that £ holds 

for A: {s + l,n-r}, |Afc(i)| > Amin(^) - \/2(l + e)dmax(n) log(4n/(i) (13) 

for fee {s + l,n-r} |Afc(A)| < y/2(l + e)dmaxlog(4n/(i). (14) 

For every k e Sn, using that Q holds, one has 


|Afc(i)| > Y 2^^dmaxlog(4n/(5) > a/2(1 + e)dmax(ra) log(4re/(5). 

From (14), this proves that k e {l,n}\{s + l,n - r}. Conversely, Let k e {l,n}\{s + l,n - r}. Using 
(13), 

|Afc(l)| > Cey/dmaxlogCdn/ii) - y/2(l + e)dmaxlog(4n/(i) 

> (Ce - y/2(l + e))\/(dmax/(l + e))log(4n/5) = ——\/dmaxlog(4n/J) 

V1 + e 

> ^ 2i^Jmaxlog(4n/5), 

where we use that T holds for the second inequality. Hence k e Sn- Thus 5^ = {1, ?T'}\{s + 1, n - r}. 

As the set Sn is of cardinality K, the matrix U in the statement of Lemma 6 coincides with Uk, the 
matrix formed by the K leading vectors of A. Assumption (12) implies in particular (using additionally 
that £ holds) that 

Amin(^) > 2y/2(l + e)(imax log(4n/(i) > 2||A - A||. 

From (10), one obtains 

\\U-U\\ = \\Uk-Uk\\< . - /IIP < log f - 

Amin(/4)^ Amin(/4)"= \ d ) 

The result follow by substituting e with rj in this last equation and in assumptions (11) and (12). 
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Proof of Lemma 13. In the SBMO model, there exists a constant c such that 


dmsLX ^ nmaxyljj < cdmax- 


Let r > 0. From Theorem 15, there exists a constant Cr such that if dmax ^ log(n.), with probability 
larger than 1 - n~^, 

11^ “ ^11 — ^ rna,x Ai ^ < ( ^/cC ^) y/d^iax • 


Letting Cr = \/cCr, under the assumption that Amin(^) ^ 2C',.\/(ima^j one has Amin(^) > 2||A - 74||. 
From (10), this yields 


Lemma 13 follows by rescaling the constant Cr- 


2 ^ 16 II .||2 ^ 16 2 


||yl-A||^< 


Ctd„ 




D.3 Proof of Theorem 14: a matrix concentration result 

Our proof is based on the following result by [Tropp, 2012]. 

Lemma 19 (Theorem 1.4, [Tropp, 2012]). Let (X^) be a sequence of independent, random, symmetric 
matrices with dimension d. Assume that each random matrix satisfies 

E[2ffc] = 0 and \max{Xk) < R almost surely 

and let be such that || YJk=i ®'[^fc]ll - Then, for all t>0, 

A„.ax (e XfcJ > tj < dexp + 


One has 


i<j 


where X-ij is a matrix of size n defined by 

Xiq ■= ^ ■{ ^ Jr 

One has ||2fjj|| < \ Ai^j - Aij \ < 1 and 


ae. + ejcf if i<j 
aC if i = j- 


i<j 


From Lemma 19, 


/ n ^ 

DiagJ 

\i=i y 


max y Aij{l - Aij) < max V Aij < dr, 
* i=i * i=i 


P(||A-^|| > admax) < 2nexpj-d] 


a 


‘2(1 +a/3) 
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Let e > 0. Choosing a = + e) log(2?T,/(5)/dmax. for 


, 2 1 + e , 2n 

Ctmax > - log — 

9 0 


(which is equivalent to a/S < e), one has 


||i - All > ^2(1 + log (^) < 2„exp < i 


\ 2(1 +a/3) 

D.4 Proof of Lemma 16: a deviation result for the empirical degrees 

For all i e {1, n}, 

n 

di ~ di = — Ai j^. 

1=1 

As E[24jj] = Aij, \Ai^j\ < 1 and Ej=i lE[24^j] = Ej=i = di, Bennett’s inequality ([Boucheron et ah, 
2013]) yields, for all i > 0 


F[di-di>t) < 
^[di-di<-t) < 


where h is the function defined by h{u) = (u + 1) log(ri + 1) -u. 

The first two inequalities follow from the fact that v ^ vh{tlv) is decreasing for all t, hence 

P (di - dj > admax) < exp(-dmax/i(a)), 
p(di-di <-admax) < exp(-dmax^(a)) • 

Let io be such that dig = d m ax - From (16), 

F{dig > dig - adrr^g.^) > 1 - 

P(dmax>(l-a)(imax) > 1 - , 

using that in particular d m ax > dig- From (15) and a union bound, 

P(Vf e {l,n},di < di + admax) > 1 - nexp (-dmax/i («)), 

P(Vf e {l,n},di < (1+ a)dmax) > 1 - nexp (-dmax/i (a)), 

P (^max ^ (1 n)rfmax) ^ 1 ~ HO^P (~r(max^ (d)) j 

by definition of dmax- The statements in Lemma 16 follow from the lower bound 

,2 


(15) 

(16) 


h{u) > 


u 


2(1 + tt/3)’ 


that can be found in [Boucheron et ah, 2013]. 
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E The sparse case: uncovering a phase transition 


Consider the following simple SBMO with two communities and a diagonal connectivity matrix such 
that if e is a vector containing only ones, the expected adjacency matrix is 


A = —ZBZ^, with B = 
n 


a 0 
0 a 


and Z = 


1 


sn 


^(l-2s)n 

0 


0 

l(l-2s)n ) 

Isn / 


where 0 < s < 1/2 is the fraction of pure nodes in each of the two communities: the smaller s, the larger 
the overlap, whereas s = 1/2 corresponds to pure nodes only, i.e. a SBM without overlap. We now 
elaborate on this example. The matrix A has rank 2 with eigenvalues ana{2- 3s) > anSa and associated 
eigenvectors that are respectively 



/ Isn \ 


( -Isn 

x = 

^(l-2s)n 

and Y = 

^(l-2s)r7. 


{ Isn } 


^ Isn / 


Each node i of the network has a spectral embedding given by {Xi,Yi), i.e. pure nodes in community 
one correspond to Pi = (1,-1), pure nodes in community two correspond to P 2 = (1) 1) and mixed 
nodes correspond to M = (2,0). As expected, we have M = Pi + P 2 and if Un is sufficiently large, i.e. 
an » logn, then Theorems 8 and 12 apply: the eigenvectors of the empirical adjacency matrix A will 
be close to the eigenvectors X and Y and as a consequence, the number of nodes that are misclassified 
by SAAC will vanish as n tends to infinity. 

Let now consider the very sparse case where an = 1. In this case our theoretical results are not valid 
and indeed we believe that there is a range of parameters where only partial recovery is possible. Note 
that if you have only access to the eigenvector X (associated to largest eigenvalue a(2 - 3s)), then it 
is possible to distinguish pure nodes from mixed nodes but it is impossible to distinguish pure nodes of 
community one from pure nodes of community two. Observe that the second eigenvalue of A, sa can be 
very small in which case, this eigenvalue will be ‘hidden’ in the noise of the model. In the sparse regime, 
it is known that high-degree nodes induce a lot of noise on the spectrum of the adjacency matrix. We now 
give a quantitative conjecture based on recent results obtained on the non-backtracking matrix [Krzakala 
et ah, 2013, Bordenave et ah, 2015, Saade et ah, 2015] which can be seen as a way to regularize the 
adjacency matrix. We refer to the works cited above for a precise description of the non-backtracking 
matrix and its spectral analysis. We should stress that the rigorous results obtained so far for the non¬ 
backtracking matrix do not allow us to cover the present framework. However, it is believed that the 
largest eigenvalue of the non-backtracking matrix for our graph will be a(2 - 3s) + o„(l) and that the 
noise, i.e. the eigenvalue A corresponding to eigenvectors not correlated with the communities will be of 
modulus |A| < \/a(2 - 3s). In particular, if sa > \/a{2 - 3s), then a second eigenvalue appears on the 
real axis at sa - 1 - o,i(l). Moreover the eigenvector associated to these 2 eigenvalues are correlated with 
the true communities. To summarize, we claim: 

Conjecture 20. If s^a > 2 - 3s then a spectral algorithm based on the non-backtracking matrix will 
classify a positive fraction of the pure nodes. 

If s^a <2- 3s then it is impossible to classify the pure nodes better than by random guessing. 

The spectral algorithm will be similar to SAAC except that we replace IJ containing the eigenvectors 
of A by the corresponding matrix computed from the eigenvectors of the empirical non-backtracking 
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matrix. The rest of the algorithm is unchanged. Figure 3 shows the spectrum of the non-backtracking 
matrix for three values of a around the phase transition at which in this particular case is 9 as 
s = 1/3. Note that in this example, it is always possible to distinguish the pure nodes form the mixed 
nodes. Only the classification of the pure nodes is non-trivial. 



Figure 3: Spectrum of the non-backtracking operator with n = 1200, sn = 400 and a = 9,11,13. The 

circle has radius \/a{2 - 3s) in each case. 

Figure 4 illustrates the behavior, in the sparse case (a^ = 1), of the SAAC algorithm as described in 
this paper (i.e. with a spectral embedding based on the adjacency matrix, not on the non-backtracking 
matrix). The fraction of correct entries (l-Error(Z, Z)) is displayed as a function of the parameter s. 
The number of nodes is fixed to n = 1000 and the curves in different colors correspond to different 
values of a. For each value of s, the error is averaged over 200 networks drawn under the corresponding 
SBMO. Overall, the algorithm performs best with large values of a (that correspond to larger degrees). 
For each value of a, the case s = 1/2 corresponds to a standard SBM without overlap and we see that our 
algorithm performs well. As s decreases, we see that below a certain value of s the performance of the 
algorithm deteriorates greatly which is in accordance with the phase transition conjectured above. 



Figure 4: Error of SAAC as a function of the fraction of each type of pure nodes in a SBMO model 
with two-by-two overlap between K = 2 communities 
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